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(57) Abstract 

^ PARTB 


A video conferencing system and method which automatically determines 
the appropriate preset camera parameters corresponding to participants partici- 
pating in the video conference. A camera zooms out or pans the video confer- 
ence space and looks for participants based on their faces. When a participant 
is detected, the preset camera parameters for that participant are calculated for 
when the center of the participant is in the center of the camera's view. This 
is continued for all the participants in the room. The optimal position for each 
participant and corresponding camera parameters are determined based on cul- 
tural preferences. Updates in the presets can be made periodically by the camera 
zooming out or panning the room. Multiple cameras can be used to continually 
update the presets. 
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OScScOWEReS^'^™'^ POSITIONS CORRESPONDING TO PARTICIPANTS IN 


This invention relates to the field of video conferencing technology and 
specifically to a method for automatically determining the appropriate pan, tilt, and zoom 
parameters of a camera which correspond to desired views of participants in a video 
conference setting. 

5 During a video conference it is necessary to know the appropriate camera 

parameters for each participant so that the view of the camera can change quickly from one 
participant to another. These parameters include the appropriate zoom, pan and tilt of the 
camera - and will collectively be referred to as the camera "parameters" with the values of 
these parameters associated with each participant being the "presets". While the conference is 
10 occurring, users require the ability to be able to view different participants quickly; frequently 
changing from one participant to another in a small amount of time. 

Prior art devices require a user to manually set the camera parameters for each 
participant involved in the video conference. Each camera being used is focused on a 
participant and a preset switch is actuated. For example, if there are three people in the 

15 conference, switch 1 is used to represent the appropriate camera parameters for participant 1 ; 
switch 2 for participant 2; and switch 3 for participant 3. When a user desires to switch the 
view between participant 1 and 2, he only needs to activate switch 2 and the camera is moved 
and focused accordingly. However, setting a camera for each participant is frequently a 
tedious process requiring a commitment of time by the camera operator or user. Additionally, 

20 every time a participant leaves or enters the room, the presets have to be readjusted 

accordingly. If a participant merely moves from his original location, the original camera 
presets will no longer apply. Clearly this is a problem if a participant moves from one location 
to another within the room. However, even if the participant moves within his own chair (i.e. 
forward, backward, leaning toward one side, etc.) the parameters may change and that 

25 participant may no longer be in focus, in the center of the camera's view, or of the desired size 
with respect to the camera's view. 

In U.S. Patent 5,598,209, a user can point to an object or person it wishes to 
view and the system automatically stores the pan and tilt parameters of the camera relating to 
the center of that object. However, all of the objects or persons in the rt)om have to be 
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affinnaUvely selected and stored under control of a user which again is time consuming There 
also ,s no provision for updating the parameters when a participant leaves or enters the room. 

The ability to automatically detennine preset positions is useful in a congress 
layout as well. Generally, in these types of n)oms. the camera presets are based upon the 
microphone being used for each individual. When a participant turns on his microphone the 
camera presets that relate to the position of that microphone are used. This is problematic 
because if the microphone does not work or if one particular microphone is used by another 
speaker, the appropriate correlation between speaker and camera view would not occur. 

Therefore, there exists a need for a video conferencing system which 
automatically determines the appropriate camera parameters for all participants and which 
also adjust itself as participants enter and leave the room. The goal of a video conference i 
effective communication and conversation. If a user continually has to readjust the system to 
mitiahze or update preset parameters, this goal is fmstrated. The conversation dynamic 
between end users is different from that of a production (as in a television show). To facilitate 
this dynamic, it is desirable to automate as much of the system as is possible without resorting 
to a static zoomed out view which would yield less meaningful communication 


can 

IS 


One aspect of the invention is a method of calculating presets of camera 
parameters corresponding to participants in a video conferencing system. The method includes 
providing a camera having tilt. pan. and zoom parameters, and defining a space based upon a 
layout of the video conferencing system. The method further includes perfonning one of 
moving the camera through all pertinent panning values, the pertinent panning values being 
defined by the space in which the video conferencing system is located, and zooming the 
camera out so that all possible participants can be viewed by the camera and so that a location 
of each participant in the space can be determined. The method further provides for detecting 
participants within the space and calculating the presets corresponding to the participants, the 
presets defining a camera view, the presets being based upon at least one of an optimal 
position of the participants in the camera view, an alignment of the center of a head of the 
participants with a center of the camera view, and an alignment of a center of a participant 
with the center of the camera view. 

This aspect, like the ones following, allows for the automatic detection and 
update of camera parameters corresponding to participants in a video conference. 


- wo 00/38414 PCT/EP99/10066 

3 

According to another aspect of the invention, a video conferencing system 
comprises at least one camera having pan, tilt, and zoom parameters. The parameters have 
preset values assigned to corresponding participants of the video conferencing system. Each of 
the presets define a camera view and are determined by: one of panning and zooming the 
5 camera throughout a space defined by the video conferencing system, detecting a participant, 
and defining a preset based on a camera position which would place the participant in one of 
an optimal position, a position where a head of the participant is in alignment with a center of 
the camera's view, and a position where a center of the participant is aligned with the center of 
the camera's view. 

10 According to yet another aspect of the invention, a video conferencing system 

comprises at least one camera having pan, tilt, and zoom parameters. The parameters have 
preset values assigned to corresponding participants of the video conferencing system; the 
presets defining a camera view. The system further includes at least one of panning means for 
panning the camera throughout a space defined by the video conferencing system, and 

15 zooming means for zooming the camera out to thereby allow the camera to view the space 

defined by the video conferencing system. A detecting means is used for detecting participants 
in the space. A determination means is used for determining presets of the camera based on a 
camera position which would place one of the participants in one of an optimal position, a 
position where a head of the participant is in alignment with a center of said camera's view, 

20 and a position where a center of the participant is aligned with the center of the camera's view. 

It is an object of the invention to provide a video conferencing system and 
method which can automatically determine the presets for camera parameters relating to 
appropriate views of participants. 

It is another object of the invention to provide a video conferencing system and 

25 method which can continually update camera presets in accordance with changes in the 
number and location of participants. 

These objects, as well as others, will become more apparent from the following 
description read in conjunction with the accompanying drawings where like reference 
numerals are intended to designate the same elements. 

30 

Figs. 1 A, IB and IC are diagrams of room, congress, and table layouts 
respectively of a video conferencing system in accordance with the invention; 


15 


20 
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Figs. 2A. 2B and 2C a« diagrams showing a participant coming into a camera's 
vaew as the camera pans a room in a video confe^ncing system according to the invention; 
Fig. 3 ,s a perspective model of a camera used in the invention; 

Fig. 4 is a diagram showing participants in a video conference with respective 
0 temporary presets indicated; 

Fig. 5 is a diagram showing the center of a participant offset from the center of 
the camera s view of that participant; 

Fig. 6 is a diagram showing participants in a video conference with respective 
updated presets indicated; 

10 Fig. 7 is a diagram showing an alternate embodiment of the invention using two 

cameras; 

Fig. 8 is a diagram of a cylindrical coordinate system used for graphing colors 
ofpixelsm images; o t- 6 

Fig. 9 is three graphs representing projections of the YUV color domain 
mdicatmg the areas where skin colored pixels lie; 

Figs. lOA-lOF are original images and respective binary images, the binary 
images bemg formed by segregating pixels based on color; 

'''^•''^^^^^Sramillustt^tinghowaSxamaskisusedaspartofluminance 
vanation detection in accordance with the invention; 

Figs. 12A and 12B are diagrams illustrating 4 and 8 type connectivity 

respectively; ^ 

Figs. 13A and 13B are images showing what the image of Figs. 3C and 3E 
would look like after the edges are removed in accordance with the invention; 

25 of Fig. 3F; '^''^^ ^PP"^^ »° ^^^^ 

Fig. 15 is a sequence of diagrams showing how components of an image are 
™ed by vertices and connected to fonn a g.ph in accordance with the invention; 

Figs. 16A - 16D are a sequence of images illustrating the application of a 
neunstic according to the invention; and 

Fig. 17 is a flow chart detailing the general steps involved in face detection. 


30 


In Fig. 1 A, a video conference system is shown where the participants are 
seated around a table. Fig. IB shows the participants in a congress style arrangement. A 
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camera 50 is controlled by a controller 52 to pan from one side of the room to the other. 
Clearly, the panning movement can begin and end in the same place. For example, as is shown 
in Fig. IC, camera 50 could be disposed in the middle of a room with the participants located 
all around it. In this type of a situation, camera 50 would rotate completely in a circle in order 
5 to completely pan the entire room. In the congressional arrangement shown in Fig. IB, camera 
50 might make multiple panning paths to cover the different rows. Each one of those paths 
would have a different tilt and probably a different zoom (although the zoom may be the same 
if participants are placed directly above one another at substantially the same radial distance 
from the camera). Again, in the congressional arrangement, camera 50 could be disposed in 
10 the center of the room and then the panning movement may require a complete rotation as was 
shown in Fig, IC. 

For simplicity, the arrangement shown in Fig. 1 A will now be further described 
although it should be apparent that the same ideas would apply to all of the arrangements 
mentioned and also other arrangements apparent to those skilled in the art. The invention will 
15 work for any space defined by the adjustabihty of the video conferencing system. Three 
participants (PartA, Parte, and Parte) are shown but, again, more participants could be 
involved. 

As camera 50 pans from one side of the room to the other, participants will 
appear to move across and through the camera's view. As shown in Figs. 2A - 2C, a participant 

20 appears at different portions of the camera's view depending on the camera's pan position. As 
can also be discerned from the figure, for three different pan positions (PI. P2, P3) the tilt (T) 
and zoom (Z) remain the same. It is also possible that during the initial camera scan, one of the 
other parameters (i.e. tilt or zoom) could be moved through an appropriate range while the 
remaining two parameters are kept constant. Another possibility is if camera 50 had its zoom 

25 parameter set so that the entire room could be seen at once (assuming enough information can 
be gleaned to determine the position of stationary participants as is discussed more clearly 
below). Again, for simplicity, the camera panning idea will be described but it should be 
apparent that the other suggestions could be implemented with appropriate changes that would 
be clear to those skilled in the art. 

30 During the initial panning, each frame which the camera processes is analyzed 

to determine whether a participant is disposed within the frame. One method for making this 
determination is detailed below in the participant detection section. Clearly, other methods 
could be implemented. For each participant that is detected, a panning camera will detect a 
multiplicity of frames which would include that participant. For example, if a camera 
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20 


P^ccs. one ^ , ^ ^ ^ ^^^^^ 

paracipants - if a panicipant is shown in eich Irame. 

of ""'"iplying.h.acwalnumber of partcipants, each 
*«.ed pamcipan. i, labeled een.„ of for each dececed pa„icipa„, i. ea,cu,a.ed 
for each proccsed f^e. T^., a second successive f^ co„,aini„g po«„,ial parUcipan., 
.s compared ,0 d,e previo., firs. W ,o see if U,e camera is viewing a new pa,„cipan. or 
JUS. anoteframe which includes fte sa„e partcipan.. One ™*od for effecuaUng d,is 
con.pa.son is ,o perfo™ a geon.e»ic ex«apola,i„n hased on .he fi„, cen.er and d,e an,o„n. 
-.a. camera has n,oved from U,e fns. posi^on. This would yield approxi™a.ely where *e 
. «n.r Should he if .he second fr^e con..„s d,e san,e pa„cipan, as .he firs. fje. Sirnilady 
center of „.ass of d,e second W could he computed «,d *en compared .o U,e firs. 
c«,ter along „i,h d,e known movemen, „f ^ ^.^^^ 

fnune ,s v,ewed and .he posi.on whe., d,e second fran. is viewed. A„en,a.ively. a si^a.u. 
couldhecrea.edforeachd..ec.edpa«cipan.andta.hesigna.u„sofpanicipl.i„ 
s^essrve f™ could he con.pa.d ,„ .ha. ini,i. signa.ure. Signa.ures are known in ^e ar. 
W examples Of signa-u., «chni,„es are discussed below in .he panicipan. iden,iflca.ion 
»d pos,.,on upda.e secdon. Once i. is detennined d,a. .he image of a panicipan. is disposed 
wiUiin a frame, lemporary piesMs can be calculated. 

ReferHng u, Fig. 3. a per^pecive model of a came,, is shown. A sensor 56 of 
me came,, has a principal poin, PP having an x and y cootdinate PPx and PPy ^specively. A 
le s « has a cen^r which is disposed a. a foca, leng* f from principal poin. PP. A change in 
2-om 0, d,e camera is effec.ua.ed by a change in d,e focal distance f. A aho„er f mea^s a 
wtde v.ew ( zoommg ou.-,. A change in the pan parameter is effecUvely a ...aUon of *e 

™ abou. U,e pan axis. A change in fte al. parameter is . totaUon of *e sensor abou. dte 
nil 3XIS. 

'^""'*j""P»«icipan.62comesln,o.hefieldofviewof.hecamen,me 
.ocauon Of *a. par«cipan. in space can he ^.ermined using convenUonal me.hods if .wo 

co„..„ing .ha. pardcipan, ar, available, ms is because .he locaUon of principal point 
PP(n w shown a, aO) and focus f are Wn. When camera 50 pans a toom, i. ac,.! 
m^upl. fr^ con..„i„g pardcipants and so *e location o, each partcpan, in space can he 

be needed .0 de»™„e the locadon. Once the location of a participant is known, the 
lemponiry preset can be calculated by a piocessor 54 (Figs. 1 A-IC). 
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To calculate the temporary preset, the center of the participant is determined, as 
above for participant labeling, using known techniques. For example, the average of the 
oudine of the participant and its center of mass can be calculated. The center point is then 
placed in the center of the camera's view to produce, for example, presets Psa, Tsa, and Zsa for 
5 PartA in Fig. 1. These panning and preset calculation processes are repeated for all participants 
in the room and. consequently, also determines how many participants are initially in the 
room. This is all performed during an initiation portion of the conference and can later be 
repeated during an update routine as is described more fully below. 

Once all of the participants in the room are labeled and all the temporary 

10 parameters are calculated as is shown in Fig. 4, camera 50 performs a second panning (or 
zooming out) of the room. Each preset view is further refined because the calibration 
performed in the initial panning phase will generally not be accurate enough. 

As shown in Fig. 5, the center of the camera*s view is compared to the center of 
the head of each respective participant. The parameters are adjusted so that in the camera's 

15 view, the centers align. Once the preset is refined, the preset corresponding to an "optimal" 
view of each participant is calculated. This may be different depending on the societal 
cultures. For example, the head and torso of a participant can take up anywhere from 30-60% 
of the entire frame - as in a news program in the United States. The optimal view produces 
updated presets Psn\ Tsn' and Zsn' as is shown in Fig. 6. These values are continuously 

20 updated depending on how the system is stmctured and how the updates are to be performed 
as is explained below. If a camera is looking at one participant and that participant moves, the 
new optimal position would be calculated and the camera preset will be continually adjusted 
accordingly. 

The camera can focus on participants based on audio tracking, video tracking, a 
25 selection made by a user, or by any other technique known in the art. Audio tracking alone is 
limited because it decreases in accuracy as people get further away and it can not be used by 
itself because it generally has a 4-5 degree error and there can be no tracking when a 
participant stops talking. 

A name can be associated with each participant once he is detected. For 
30 example, the three participants of Fig. 1 could be identified A, B, and C so that a user could 
merely indicate that it wishes to view participant A and the camera will move to the optimized 
preset for A. Additionally, the system could be programmed to learn something specific about 
each participant and thus label that participant. For example, a signature could be created for 
each participant, the color of the person's shirt, a voice pattern could be taken, or a 


20 


25 


30 
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combinauon of »,e f,ce and voice could be used >o fo™, fte label associated „i,h a 
pardcipan.. Wi,h Ws e,« i„,„™,aSo„. if p.,.cipan, A ™oves a,.„„d .he ro<^. d,e sys»„ 
w,n la.o„ Which panicipao, is moving and will no, be confcscd by panicipau, A wallang 
dircH.^ .he View conesponding ,o pa™„e«. for pa«icipan. B. Moreover, if ,„o pa^cipan. 
are loca»d close enough ,o one anofter » fta. tey share a can^era's view, U,e ,wo parUcip^.. 
can be conside^d as one pardcipan, wi.h d» can-era focusing on .he cen.er of ,he combinaUon 
or their images. 

*^"='=^'"»™.<>"='»"«fi' of 'his system is d,a, is allows for .he presets », be 
.^t^ucally adjusted as the dynan.cs of the room, participants change. Clearly, if aprese, is 
selected and the cotxesponding partcipan, ,as left the toom. the system will sense dtis and 
update the presets. Another m«hod of updaUng is that every Ume a new ptese. is selec^d 
50 Will ^m out (or pan the .^m, to see if any people have com. into or left the ' 
™om and update the presets before came™ 50 moves to the selected preset. Camera 50 could 
be controlled to periodically, even while i, is instructed to view a selected pa.«cipant 
temporarily stop viewing that panicipant, and pan ,he room or zoom out to sec i, the dumber 

iTr! "^"""^ " -^^'^'"^ ^ <^'^ >^ - ^« 

22 -ave been. Por example, if camera 50 is told to move from the preset for pardcipan. C 

antcpan. A for exanrple (Pig. 1), if , ^ ^ ^^^^ ^ 

*a and make the appropriate adjustments. Ye, another techni<,ue of updating involves camera 
^-■"S .'-ough the room (or zoonUng out, either ^riodically or every .me a new preset is 

•"'^'■^^""''^■"Wiment is shown. This embodin«nt shows 
•he same features as dtose in F,g. I A except *at a second camera 64 is added The inlM 

caltbtarion is performed the same as was described above. During the confeience, however 
™ra ,s used to focus on the pertinent parti«pa„. while the other is .ed .„ co„.inuo.ly 
update the presets. The updating camera can conrinually be zoomed ou, so that it can 
detemnne when a parricipan, leaves or enters the toom. Altetnatively, the updating came,, 
could conttnuall, pan the ™om and make app^priate updates to the ptese.. ne two cameras 
*e preset inf„™,a.i™, d,rough, for example, pr^essor 54. CleaHy, more camera could 
b^^^ed. For example, one camera could be allocated for each individual .hat is planned .0 be 
at the meeting and then an addidonal camera could be used as *e updating cameta 

One way of determining whether a participant is located within a camera's view 
^-e^nntne whether thete is a face disposed within the image being viewed by the c„«„ 
Each p,x., m an tmage is generally represent«i in the HSV (hue, saturarion, value, color ' 


- wo 00/38414 PCT/EP99/10066 

9 

domain. These values are mapped onto a cylindrical coordinate system as shown in Fig. 8 
where P is the value (or luminance), 9 is the hue. and r is the saturation. Due to the non- 
linearity of cylindrical coordinate systems, other color spaces are used to approximate the 
HSV space. In the present applications, the YUV color space is used because most video 
5 material stored on a magnetic medium and the MPEG2 standard both use this color space. 

Transforming an RGB image to the YUV domain, and further projecting into 
the VU, VY, and VU planes, produces graphs like those shown in Fig, 9. The circle segments 
represent the approximation of the HSV domain. When pixels corresponding to skin color are 
graphed in the YUV space, they generally fall into those circle segments shown. For example, 
10 when the luminance of a pixel has a value between 0 and 200, the chrominance U generally 
has a value between -100 and 0 for a skin colored pixel. These are general values based on 
experimentation. Clearly, a color training operation could be performed for each camera being 
used. The results of that training would then be used to produce more precise skin colored 
segments. 

15 To detect a face, each pixel in an image is examined to discern whether it is 

skin colored. Those pixels which are skin colored are grouped from the rest of the image and 
are thus retained as potential face candidates. If at least one projection of a pixel does not fall 
within the boundaries of the skin cluster segment, the pixel is deemed not skin colored and 
removed from consideration as a potential face candidate. 

20 The resultant image formed by the skin color detection is binary because it 

shows either portions of the image which are skin color or portions which are not skin color as 
shown in Figs. lOB, lOD, and lOF which correspond to original images in Figs. lOA, IOC, and 
lOE. In the figures, white is shown for skin color and black for non-skin color. As shown in 
Figs. lOA and lOB, this detecting step alone may rule out large portions of the image as 

25 having a face disposed within it. Prior art techniques which use color and shape may thus work 
for simple backgrounds like that shown in Fig. lOA. However, looking at Figs. IOC and lOD 
and Figs. lOE and lOF, it is clear that detection by color and shape alone may not be sufficient 
to detect the faces. In Figs. lOC-lOF, objects in the background like leather, wood, clothes, 
and hair, have colors similar to skin. As can be seen in Figs. lOD and lOF, these skin colored 

30 objects are disposed immediately adjacent to the skin of the faces and so the faces themselves 
are difficult to detect. 

After the pixels are segregated by color, the pixels located on edges are 
removed from consideration. An edge is a change in the brightness level from one pixel to the 
next. The removal is accomplished by taking each skin colored pixel and calculating the 


- wo 00/38414 

PCT/EP99/10066 

variance in the pixels around it in the luminance component; a high variance being indicative 
o an edge. As is shown in Fig. 1 1. a box. ("window") the size of either 3x3 or 5x5 pixels is 
placed on top of a skin colored pixel. Clearly, other masks besides a square box could be used 

The variance is defined as 

where a. is the average of all the pixels in the examined window. A "high" variance level will 
be different depending upon the face and the camera used. Therefore, an iterative routine is 
used starting with a very high variance level and working down to a low variance level. 

At each step of the variance iteration, pixels are removed from facial 
consideration if the variance in a window around the skin colored pixel is greater than d,e 
vanance threshold being tested for that iteration. After all of the pixels are examined in an 
ueratron, the resulting connected components are examined for facial characteristics as is 
descnbed more fully below. Connected components are pixels which are of the same binary 
value (white for facial color) and connected. Connectivity can be either 4 or 8 type 

connectivity. As shown in Fig. 12A. for 4 type connectivity, the center p.xel ,s considered 
connected" to only pixels directly adjacent to it as is indicated by the "1 " in the adjacent 

boxes. In 8 type connectivity, as is shown in Fig. 12B. pixels diagonally touching the center 

pixel are also considered to be "connected" to that pixel. 

As stated above, after each iteration, the connected components are examined in 

a component classificaUon step to see if they could be a face. This examination involves 

looking at 5 distinct criteria based upon a bounding box drawn around each resulting 

connected component; examples of which ar^ shown in Fig. 14 based on the image of Fig 

lOE The criteria are: 

1 . The area of the bounding box compared to a threshold. This recognizes the fact that a face 
will generally not be very large or very small. 

2. The aspect ratio (height compared to the width) of the bounding box compared to a 
^r^shold. n,is recognizes that human faces generally fall into a range of aspect ratios 

3. The ratio of the area of detected skin colored pixels to the area of the bounding box 
compared to a threshold. This criteria .cognizes that fact that the area covered by a human 
face will fall mto a range of percentages of the area of the bounding box 

4. me orientation of elongated objects within the bounding box. There are many known 

ways of deter^ning the Orientation ofaseries of pixels. For example, the medial axiscan 
be determined and the orientation can be found from that axis. In general, faces are not 
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rotated significantly about the axis ("z-axis") which is perpendicular to the plane having 
the image and so components with elongated objects that are rotated with respect to the z- 
axis are removed from consideration. 
5. The distance between the center of the bounding box and the center of mass of the 
5 component being examined. Generally, faces are located within the center of die of the 
bounding box and will not, for example, be located all to one side. 

The iterations for variance are continued thereby breaking down the image into 
smaller components until the size of the components is below a threshold. The images of Figs. 
IOC and lOE are shown transformed in Figs. 13A and 13B respectively after the variance 
10 iteration process. As can be discerned, faces in the image were separated from the non-facial 
skin colored areas in the background as a result of the variance iteration. Frequently, this 
causes the area with detected skin color to be fragmented as is exemplified in Fig. 13B. This 
occurs because either there are objects occluding portions of the face (like eyeglasses or facial 
hair) or because portions were removed due to high variance. It would thus be difficult to look 
15 for a face using the resulting components by themselves. The components that still can be part 
of face after the variance iteration and component classification steps, are connected to form a 
graph as shown in Fig. 15. In this way, skin colored components that have similar features, 
and are close in space, are grouped together and then further examined. 

Referring to Fig. 15, each resulting component (that survives the color 
20 detecting, edge removal, and component classification steps) is represented by a vertex of a 
graph. Vertices are connected if they are close in space in the original image and if they have a 
similar color in the original image. Two components, i and j, have a similar color if: 
\Yi-Yj\<ty^\Ui-Uj\<tuANDUNEVi-Vj\<h 

25 where Y„, Un, and Vn are the average values of the luminance and chrominance of the n* 
component and tn are threshold values. The thresholds are based upon variations in the Y, U, 
and V values in faces and are kept high enough so that components of the same face will be 
considered similar. Components are considered close in space if the distance between them is 
less than a threshold. The spatial requirement ensures that spatially distant components are not 

30 grouped together because portions of a face would not normally be located in spatially distant 
portions of an image. 

The connection between vertices is called an edge. Each edge is given a weight 
which is proportional to the Euclidean distance between the two vertices. Connecting the 
vertices together will result in a graph or a set of disjointed graphs. For each of the resulting 


15 
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graphs, the minimum spanning tree is extracted. The minimum spanning tree is generally 
defined as the subset of a graph where all of the vertices are still connected and the sum of the 
lengths of the edges of the graph is as small as possible (minimum weight). The components 
corresponding to each resulting graph is then classified as either face or not face using the 
5 shape parameters defined in the component classification step mentioned above. Then each 
graph is split into two graphs by removing the weakest edge (the edge with the greatest 
weight) and the corresponding components of the resulting graphs are examined again. The 
divjsion continues until an area of a bounding box formed around the resultant graphs is 
smaller than a threshold. 

10 By breaking down and examining each graph for a face, a set of all the possible 

locations and sizes of faces in an image is determined. This set may contain a large number of 
false positives and so a heuristic is applied to remove some of the false positives. Looking for 
all the facial features (i.e. nose, mouth, etc.) would require a template which would yield too 
large of a search space. However, experimentation has shown that those facial features have 
edges w.th a high variance. Many false positives can be removed by examining the ratio of 
high variance pixels inside a potential face to the overall number of pixels in the potential face. 

The aforementioned heuristic is effectuated by first applying a morphological 
closmg operation to the facial candidates within the image. As is known in the art. a mask is 
chosen and applied to each pixel within a potential facial area. For example, a 3x3 mask could 
be used. A dilation algorithm is applied to expand the borders of face candidate components 
Then an erosion algorithm is used to eliminate pixels from the borders. One with ordinary skill 
m the art will appreciate that these two algorithms, performed in this order, will fill in gaps 
between components and will also keep the components at substantially the same scale 
Clearly, one could perfonn multiple dilation and then multiple erosion steps as long as the 
25 both are applied an equal number of times. 

Now, the ratio of pixels with a high variance neighborhood inside the face 
candidate area is compared to the total number of pixels in the face candidate area. Refening 
to Figs. 16A to 16D. an original image in Fig. 16A is examined for potential face candidates 
usmg ti,e methods described above to achieve the binary image shown in Fig. 16B. The 
morphological closing operation is performed on the binary image resulting in the image 
shown in Fig. 16C. Finally, pixels with high variance located in the image of Fig 16C are 
detected as is shown in Fig. 16D. The ratio of the high variance pixels to the total number of 
pixels can then be detennined. The entire participant detection method is summarized by steps 
S2-S16 shown in Fig. 17. 


20 


30 
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As can be discerned, by controlling a camera to view a space defined by a video 
conferencing system, camera parameter presets corresponding to participants, can be 
calculated automatically and updated continuously. 

Having described the preferred embodiments it should be made apparent that 
various changes could be made without departing from the scope and spirit of the invention 
which is defined more clearly in the appended claims. 
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1 • A method of calculating presets of camera parameter, corresponding to 

participants (Part A. Part B. Part C) in a video conferencing system, said method comprising- 

- providing a camera having tilt, pan. and zoom parameters (50); 

- defining a space based upon a layout of said video conferencing system; 
performing one of 

moving said camera Umugh all pcninem panning valuw, aaid peninent 
panning values being deHned by said space in which said video confe^cing sys,e„ is 

located, and 

zooming said camera ou. so ,ha, all possible participanis can be viewed by said 
camera and so d,a, a location of each participan. in said space can be detemnned; 
- delecting said panicipants wilhin said space; and 

■ calculating said preseB con^sponding ,o said participants, said peseta defining a camen, 
view said pn^sets being based upon at least one of an optimal position of said participants 
.n s„d camera view, an alignment of the center of a head of said participants with a center 
of said camera view, and ^ alignment of a center of a patticipant with s« cent^ of said 

camera view. 


2- The method as claimed in claim 1 further comprising tracking said participants 

by associating a label with each of said participants. 

3^ The method as claimed in claim 1 further comprising updating said presets by 

having said video conference system perform at least one of adjusting a preset when that 
preset IS chosen by a user, deleting a preset when the participant co.esponding to the preset 
leaves said space, and repeating said perfoiming. 

4- The method as claimed in claim 1 where in said calculating step, when more 

than one participant is within said camera view, the participants are combined into one 
combined image and the center of the combined image is used to detennine said presets 
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5. The method as claimed in claim 1 wherein said step of detecting comprises: 

- providing a digital image composed of a plurality of pixels (52); 

- producing a binary image from the digital image by detecting skin colored pixels (54); 

- removing pixels corresponding to edges in the luminance component of said binary image 
5 thereby producing binary image components (56); 

- mapping said binary image components into at least one graph (512); and 

- classifying said mapped binary image components as facial and non-facial types wherein 
the facial types serve as facial candidates (514). 

10 6. The method as claimed in claim 5 further comprising the step of applying a 

heuristic, said heuristic including the following steps: 

- applying a morphological closing operation on each of said facial candidates to produce at 
least one closed facial candidate; 

- determining high variance pixels in said closed facial candidate; 

15 - determining the ratio between said high variance pixels and the total number of pixels in 
said closed face candidate; and 

- comparing said ratio to a threshold. 


7. The method as claimed in claim 5 wherein said step of removing includes: 

20 - applying a mask to a plurality of pixels including an examined pixel; 

- determining the variance between said examined pixel and pixels disposed within said 
mask; and 

- comparing said variance to a variance threshold. 


25 8. The method as claimed in claim 7 wherein: 

- said step of removing is repeated for decreasing variance thresholds until a size of said 
binary image components is below a component size threshold; and 

- after each step of removing said step of classifying said components is performed. 


30 


9. 

connected. 


The method as claimed in claim 5 wherein said binary image components are 


- wo 00/38414 

, . PCT/EP99/10066 

10. ■™""=*«'«^'"^n>«""'Mai»5wherei„.aid«epofctosifyi„gcomprise, 
onrnng a bounding box an>und a clarified co™pon«n, of said component and p«fo™i„g ., 
least one of: ^ 

- fonning a bounding box around a classified component of said components; 

- comparing an area of the bounding box to a bounding box threshold; 

- comparing an aspect ratio of the bounding box to an aspect ratio threshold- 

- detennining an area ratio, said area ratio being the comparison between the area of said 
classified component and the area of said bounding box. and comparing said area ratio to 
an area ratio threshold; 

- determining an orientation of elongated objects within said bounding box- and 

- detennining a distance between a center of said bounding box and a center of said 
classified component. 

15 Steps- ^ ''^"^ ' '''' ^''"'PP'"^ 

- representing each component as a vertex; 

■ conncciing vcic., wiU, an edge when elosein space and similar in color. ,l,e,eby fonning 

said at least one graph. 


10 


20 12. 


25 


30 


The method as claimed in claim 11 wherein each edge has an associated weight 
and further comprising the steps of: 

extracting the minimum spanning ti^ee of each graph; 

• Classifying ,he con«po„ding binary inage components of each grapl, as eifter face or no, 

race; 

• removing the edge in each graph with the greatest weight thereby fonning two smaller 
graphs; and 

« said Step Of Classifying d. co,resp„„di„g binary image component for each of 
-.d smaller gophs nnffl a bounding box around said smaller graphs is smaller >han a graph 
threshold. ^ 


13. 


The method as claimed in claim 1 further comprising: providing at least 
second camera for updating said presets by executing said performing. 


14. 


A video conferencing system comprising: 
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- at least one camera having pan, tilt, and zoom parameters (50); 

- said parameters having preset values assigned to corresponding participants (Part A, Part 
B, Part C) of said video conferencing system; 

- each of said presets defining a camera view and being determined by: 

5 one of panning and zooming said camera throughout a space defined by said 

video conferencing system, 

detecting a participant, and 

defining a preset based on a camera position which would place said participant 
in one of an optimal position, a position where a head of said participant is in alignment with a 
10 center of said camera's view, and a position where a center of said participant is aligned with 
said center of said camera's view. 


15. The video conferencing system as claimed in claim 14 further comprising 
means for tracking said participants by associating a label with each of said participants. 

15 

16. The video conferencing system as claimed in claim 14 further comprising 
means for updating said presets by having said video conference system perform at least one 
of adjusting a preset when that preset is chosen by a user, deleting a preset when the 
participant corresponding to the preset leaves said space, panning said camera through said 

20 space, and zooming said camera through said space. 

17. The video conferencing system as claimed in claim 14 wherein when more than 
one participant is with in said camera view, the participants are combined into one combined 
image and the center of the combined image is used to determine said presets. 

25 

18. The video conferencing system as claimed in claim 14 wherein said detecting 
comprises: 

- providing a digital image composed of a plurality of pixels (52); 

- producing a binary image from the digital image by detecting skin colored pixels (54); 

30 - removing pixels corresponding to edges in the luminance component of said binary image 
thereby producing binary image components (56); 

- mapping said binary image components into at least one graph (512); and 

- classifying said mapped binary image components as facial and non-facial types wherein 
the facial types serve as facial candidates (514). 
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The video conferencing system as claimed in claim 14 further comprising at 
least a second camera for updating said presets by perfomiing at least one of panning said 
camera through said space, and zooming said camera through said space. 

20. A video conferencing system comprising: 

- at least one camera having pan, tilt, and zoom parameters (50); 

- said parameters having preset values assigned to corresponding participants of said video 
conferencmg system, said presets defining a camera view; 

- at least one of panning means for panning said camera throughout a space defined by said 
video conferencing system, and zooming means for zooming said camera out to thereby 
allow said camera to view the space defined by said video conferencing system; 

- detecting means for detecting participants in said space; and 

- detennination means for determining presets of said camera based on a camera position 
which would place one of said participants in one of an optimal position, a position where 
a head of said participant is in alignment with a center of said camera's view, and a position 
where a center of said participant is aligned with said center of said camera's view. 

21. The video conferencing system as claimed in claim 20 further comprising 
means for tracking said participants by associating a label with each of said participants. 

22. The video conferencing system as claimed in claim 20 further comprising 
means for updating said presets by having said video conference system perform at least one 
of adjusting a preset when that preset is chosen by a user, deleting a preset when the 
participant corresponding to the preset leaves said space, panning said camera throughout said 
space, and zooming said camera throughout said space. 

23. The video conferencing system as claimed in claim 20 wherein when more than 
one partrcipant is within said camera view, the pariicipants are combined into one combined 
image and the center of the combined image is used to detennine said presets. 

24. The video conferencing system as claimed in claim 20 wherein said detecting 
comprises: 


providing a digital image composed of a plurality of pixels (52); 
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- producing a binary image from the digital image by detecting skin colored pixels (54); 

- removing pixels corresponding to edges in the luminance component of said binary image 
thereby producing binary image components (56); 

- mapping said binary image components into at least one graph (512); and 

- classifying said mapped binary image components as facial and non-facial types wherein 
the facial types serve as facial candidates (514). 

25. The video conferencing system as claimed in claim 20 further comprising at 

least a second camera for updating said presets by performing at least one of panning said 
camera throughout said space and zooming said camera throughout said space. 
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